Energy consumption prediction using the GRU-MMattention-LightGBM model with features of Prophet decomposition

The prediction of energy consumption is of great significance to the stability of the regional energy supply. In previous research on energy consumption forecasting, researchers have constantly proposed improved neural network prediction models or improved machine learning models to predict time series data. Combining the well-performing machine learning model and neural network model in energy consumption prediction, we propose a hybrid model architecture of GRU-MMattention-LightGBM with feature selection based on Prophet decomposition. During the prediction process, first, the prophet features are extracted from the original time series. We select the best LightGBM model in the training set and save the best parameters. Then, the Prophet feature is input to GRU-MMattention for training. Finally, MLP is used to learn the final prediction weight between LightGBM and GRU-MMattention. After the prediction weights are learned, the final prediction result is determined. The innovation of this paper lies in that we propose a structure to learn the internal correlation between features based on Prophet feature extraction combined with the gating and attention mechanism. The structure also has the characteristics of a strong anti-noise ability of the LightGBM method, which can reduce the impact of the energy consumption mutation point on the overall prediction effect of the model. In addition, we propose a simple method to select the hyperparameters of the time window length using ACF and PACF diagrams. The MAPE of the GRU-MMattention-LightGBM model is 1.69%, and the relative error is 8.66% less than that of the GRU structure and 2.02% less than that of the LightGBM prediction. Compared with a single method, the prediction accuracy and stability of this hybrid architecture are significantly improved.


Introduction
Short-term prediction of energy consumption predicts, estimates, analyzes, judges and speculates on the future development of the energy system, mainly by constructing a mathematical model reflecting the internal activities and external connections of the energy system. The prediction and accurate control of energy consumption have become important for energy savings and emission reduction. The prediction of short-term energy consumption data can parameter selection, and the model stability and accuracy are not good enough. A single GRU or a single attention is not as good at predicting periodic time series as a combination of the two. Current research cannot solve the above three key problems simultaneously. Based on simple time series decomposition, we propose a feature selection method based on prophet decomposition. The gating-attention mechanism can learn the structure of internal correlation between features. The ensemble learning structure has a strong anti-noise ability and can reduce the impact of mutation points on the overall prediction effect of the model. We proposed a hybrid model architecture of GRU-MMattention-LightGBM. This architecture inherits the advantages of every single architecture and has a more stable prediction effect than a single model.

Prophet decomposition
Prophet was developed by Facebook's data science team in 2017 (Taylor S J et al. [15]). It used a decomposable time series model (Chung et al. [16]), with three main components: trend, seasonality, and residuals. Although the influences on time series are complex, all series fluctuations can be decomposed into four parts: trend factors T t , cyclic fluctuations C t , seasonal changes S t , and residuals R t . AC Harveyd et al. [17] simplified the above four items into three items (T t , S t , R t ). In this paper, we combined the characteristics of energy consumption data and made the following simplifications in the time series decomposition model, as given by Eq (1).
In the Prophet model, the trend term T t contains two items, namely, the segmental model based on linear regression and the saturated growth model based on logistic regression. The linear part is shown in Eq (2).
where k represents the growth rate of the model, δ is the change in k, m is the offset parameter, t is the timestamp, α(t) is the indicator function, α(t) T is the transpose vector of α(t), γ is the offset of the smoothing process, and its function is to make the piecewise function continuous. The expression of the saturated growth model based on logistic regression is Eq (3).
where C represents the model bearing capacity, k represents the growth rate, and m is the offset parameter. When the rate k is adjusted, the offset parameter must also be adjusted. Next, the piecewise logistic growth model is Eq (4).
The Prophet model can incorporate trend changes into the model by setting change points to change the growth rate. Assuming that for timestamp t, there are n change points, then ðtÞ ¼ ða 1 ðtÞ; . . . ; a n ðtÞÞ T ; γ ¼ ðg 1 ; . . . ; g n Þ T . Additionally, for a moment s j , its offset γ j , γ j is set to −s j δ j . The trend generation model is that there are S rate points in the history of point T, and the rate change of each changing point is δ j~L aplace(0, τ).
The seasonal and residual terms can be represented as Eq (5) and Eq (6) Zarnowitz V et al. [18] pointed out that trend, season, and residual are interrelated and not independent among each other and established a model of interdependence among the three. Therefore, we propose the following more general assumptions, as Eq (7), Eq (8) and Eq (9) are given.
f 1 , f 2 , f 3 are different nonlinear functions, expressed here in the form of an implicit function. Therefore, T t , S t , I t , which is determined by the joint action of T t−i , S t−i, R t−i , i2 [1, k] and i2N. Therefore, the structure of the feature based on Prophet decomposition is Eq (10).
F contains Trend,Season,Innovation in the first K periods of the energy consumption sequence. Together, they serve as feature inputs to the predictive model. Finally, learn that f 1 , f 2 , f 3 are learned and x t is predicted through the following hybrid architecture.

LightGBM
LightGBM is a variant of the gradient boosting decision tree (GBDT) [19]. We use the Prophet method to extract features and transform the time series forecasting problem of electricity consumption into a supervised learning problem. It is hoped that the learner of ensemble learning can better learn the nonlinear interaction between the trend, season, and residual based on the features produced by Prophet. LightGBM is based on the additive model of the boosting strategy. During training, the forward stagewise algorithm is used for greedy learning. Each iteration learns a CART tree to fit the residual between the prediction result of the previous t−1 tree and the true value of the training sample. In each combination, the weak learner was better than the previous group. Similar to XGBoost, LightGBM is explicitly regularized. The first half is the loss function, and the second half is the regular term L 1 +L 2 . The approximate objective function is obtained by a second-order Taylor expansion of the loss function, as shown by Eq (11). Among The GBDT needs to scan all the data to estimate all possible split points for information gain, which takes considerable time and memory. LightGBM uses the histogram decision tree algorithm to make the memory footprint smaller and improve the calculation speed. It also uses the GOSS algorithm to calculate only high gradient data, which further saves space and time overhead. The EFB algorithm is also used to bundle features, which reduces the dimension and time complexity of the algorithm. LightGBM uses a leafwise algorithm with depth constraints. In addition, it supports efficient parallel computing and increases the cache hit rate.

GRU
Cho K et al. [20] proposed an improved structure of the gating mechanism in 2014 to better solve the long-term dependency problem, GRU (gate recurrent unit), which optimizes the gate function of LSTM, combining the forget gate and the input gate in one update in the door. The update gate contains both the neuron state and the hidden state, which can reduce the complexity of the network unit, reduce the number of parameters, and greatly shorten the training time of the model. We use GRU to forecast time series; in contrast, timeliness is more important to the whole system, so we choose the GRU structure that consumes less time and has the same prediction effect as LSTM. A schematic diagram of its gate control structure is shown in Fig 1. For each unit in the sequence, we denote σ as a sigmoid function. Tanh represents a hyperbolic tangent function. x t is the input at time t. It is the implicit state h t−1 of the moment t−1, which contains the dependency information of each previous moment. r t represents the reset gate, and z t stands for the update gate. means that the calculation logic of the two gates of the update gate is to splice the input of the current moment and the hidden state of the previous moment, and the output is controlled between [0, 1] through the sigmoid function. The calculation logic of the two gates is to join the input of the current moment and the hidden state of the last moment and control the output between [0, 1] through the sigmoid function. The output is inhibited as it approaches 0 and activated as it approaches 1.
First, the reset gate and update gate formulas are Eq (12) and Eq (13).
Then, the reset gate is used to reset the information, and the data are scaled to the range of [-1, 1] by the Tanh function to obtain n t . n t contains the information to be added at the current moment, which is equivalent to memorizing the state of the current moment, as given by Eq (14).
The last stage outputs the final hidden information. The function of this step is to forget some dimension information passed down and add some dimension information input by the current node. Output the current moment y t according to its hidden information, as given by Eq (15) and Eq (16).

MMsAttention
Vaswani et al. [21] proposed a multihead attention mechanism, which uses different heads for different representation subspaces under the structure of the self-attentional mechanism. Multihead attention enables the model to jointly pay attention to different representational subspace information at different locations. In addition, multihead attention can also consider the information of different head positions to capture the intraday variation regularity of energy consumption more forcefully. In this paper, a multihead self-attention structure is proposed for the energy consumption prediction problem. As shown in Fig 2. We take x i as input to the features corresponding to the decomposition of the volume of each hour of the day; there are 24 hours in a day, so the input of the masked multihead attention block is a vector sequence whose length M is 24. The sequence of vectors can be represented as an input matrix I, given by Eq (17).
2.4.1 Self-attention mechanism. The essence of a self-attention function can be described as mapping a query and a set of key-value pairs to an output, where query, key, value, and output are all vectors. Each feature of the input has a set of vectors consisting of q, k and v. Q, K and V are the concatenation of all vector sequences of q, k and v, respectively. More intuitively, the attention mechanism is an operation that computes the similarity between a query and a key and extracts the query-related values for a weighted sum, as given by Eq (18), Eq (19), and Eq (20).
Both W q , W k , W v are matrices that need to be updated through iterative training. The dimensions of the Q and K matrices corresponding to the input vector are both d k , and the dimension of the V matrix is d v . In the attention matrix, the larger the value of the element is, the stronger the interaction relationship between energy consumption in different periods. The correlation matrix is given by Eq (21).
The output of the self-attention layer is the weighted sum of the respective values, and the weight corresponding to each value is divided by the inner product of the corresponding query and key divided by the ffi ffi ffi ffiffi d k p of the corresponding key so that the inner product will not be too large. As is given by Eq (22).

MultiHeadðQ; K; VÞ
where the projections are parameter matrices The difference between the energy prediction and the text translation task is that the text translation can combine the following information of the input to output the above content translation. Energy consumption forecasting can only predict the future based on past energy consumption, so we propose a masked multihead attention structure. This masking ensures that the predicted value at time i can only rely on known outputs less than time i. The elements in corraltion matrix can be expressed as Eq (25).
C is a lower triangular matrix, which ensures that the network will not see future information when making predictions and the results will not cheat us or make the prediction effect too accurate for us to rely on.

GRU-MMattention hybrid model architecture
As shown in Fig 3, to enable the network model to jointly pay attention to different representation subspace information at different locations, that is, to learn the interaction relationship between electricity consumption in different periods within a day, we added a masked-multihead attention layer. Due to the large network depth, the Add&Norm layer is adopted to improve the prediction effect and speed up the network convergence. Without layer normalization, the gradient descent process is slow, and the descent trajectory fluctuates greatly. Then, the dimensionality is reduced with a feed-forward layer. Next, the data pass through the Add&Norm layer, and finally, the final result is output through the Linear layer.

GRU-MMattention-LightGBM model prediction process
Hybrid model training mainly includes three steps: feature construction, model training and prediction. The process is shown in Fig 4. Step 1, feature extraction on original data using Prophet. After cleaning the original data, according to the Prophet decomposition method in section 2.1, the time series of energy consumption is decomposed into Trend, Seasonality and Residual. Then, features are built based on the data features as Eq (12).
Step 2, model training. The LightGBM model is trained on the training set. We pick the model with the best prediction performance on the validation set. After that, the parameters of the LightGBM model are frozen and saved. Then, the hyperparameters of GRU-MMattention are set and trained on the training set. The results obtained by the neural network and the results of LightGBM on the training set are used as the input of the multilayer perception (MLP) at the same time. The weights of both models are learned by the MLP. Finally, only the parameters of the neural network are iteratively adjusted according to the results of the validation set, and the best hybrid model is selected.
Step 3, prediction. The ensemble model trained in step 2 is saved, and prediction is performed on the test set.

Data processing process
We selected the energy consumption data of the US PJM regional energy supply company in 14 regions from the Kaggle data website for research (https://www.kaggle.com/datasets/ robikscube/hourly-energy-consumption). The energy consumption data from 0:00 on January 1, 2015, to 23:00 on August 3, 2018, were selected, with a frequency of 1 hour, and a total of 31440 samples. We performed k-neighbor imputation for the 4 missing values and removed 2 duplicate values. The first 85% of the data are the training set, 5% of the data are the validation set, and 10% of the data are the test set. Fig 5 shows that the data of daily electricity consumption with the frequency of hours have a significant intraday cycle. The data of electricity consumption with a frequency of days have a significant seasonal effect. Combined with the censoring feature of PACF, the selection of the hyperparameter for the length of the feature window K is 3. In addition, after the preexperiment of grid search with K = [1, 6] when K = 3 on LightGBM and GRU, the test set performs best. Therefore, the window length K of feature selection is 3.  It can be seen from subgraphs 1 and 2 of Fig 6 that the energy consumption cycle is decomposed very neatly, and the feature extraction is very effective.

Feature selection and normalization
To improve the convergence speed of the neural network, when the neural network model is established, minimum-maximum value normalization is performed on the features of each dimension in all samples so that the original data are in the range of [0, 1]. The normalized equation is Eq (26).
where x � ij represents the normalized data, x ij represents the original data, N is the number of samples, min 1�i�N x ij is the minimum value in the same dimension feature in all samples, and max 1�i�N x ij is the maximum value in the same dimension feature in all samples.
Since the normalized data prediction is not the real predicted value, it is necessary to save the conversion factor to facilitate denormalization after the prediction and obtain the actual predicted value. The denormalization method is shown in Eq (27).
y represents the final predicted value after denormalization,ŷ 0 represents the model prediction value under normalized data training, y max represents the maximum value of the labels in the test training set, and y min represents the minimum value of the labels in the test training set.

Evaluation indicators
This article discusses the problem of time series forecasting. Labels are numerical data, and we are more concerned with the gap between the actual value and the predicted value. Therefore, MAPE, MAE, MSE and RMSE are selected to measure the prediction accuracy and generalization ability of different models. y i represents the true value, i = 1,2,. . .,N. Eq (28), Eq (29), and Eq (30) are as follows: RMSE ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 1 N

GRU.
We used a grid search to adjust the hyperparameters in the preexperiment, and the hidden size ranged from {16,32,64,128}. The number of layers is {1,2,3,4}, and the group with the lowest MAPE and MAE is selected for comparison, as shown in Table 1.

GRU-MMattention.
Due to the logical requirements of time series prediction, we use the masked multihead attention architecture, and the number of heads is set to 2. To capture the interaction between different periods within a day, we also added positional encoding to the architecture. The same GRU-MMattention also uses grid search to adjust the hyperparameters in the preexperiment, and the hidden size ranges from {16,32,64,128}. The number of

Light GBM.
Jiang X et al. [22] denoted LightGBM using moving temporal window features as TFLightGBM. We denote the LightGBM model of Prophet decomposition features as PlightGBM. Set the number of leaves as 31, the maximum depth as 5, the learning rate as 0.01, the number of estimators as 100, the minimum subtree weight as 0.01, and the minimum subtree sample as 20.
As Table 3 shows, PlightGBM is much more accurate than TFLightGBM, which shows that Prophet decomposition is significant in feature construction. The PlightGBM prediction results are shown in Fig 8. 3

Model comparison results
3.6.1 Comparison of the methods' accuracy. Based on the results in 3.5, we compare the optimal models in each structure, as shown in Table 4 and Fig 10. 3.6.2 Comparison of the antinoise ability of the methods. The strong anti-noise capability in this paper refers to the strong anti-interference capability of the model against the noise of the training set. Specifically, if noise appears in the training data set and the model learns the data contaminated by noise, the prediction result is still not much different from the data without noise; then, we can conclude that the model has a good anti-noise ability. We selected the energy consumption data of PJM regional power supply companies, different from those in the previous areas. The power consumption data from 0:00 on January 1, 2017 to 23:00 on August 3, 2018 were selected, with a frequency of 1 hour, and a total of 11380 samples. After data cleaning, we added noise to 10% of the data in the training set, Noise~N(0,1000), where 1000 is the standard deviation of the power consumption data itself. Finally, the model is retrained, and metrics are calculated on the test set. The experiment was repeated 10 times, and the metrics were recorded each time. After the t test of the 10 experimental results, if the metrics did not change significantly, the GRU-MMatten-LGB method could be considered to have a strong anti-noise ability.   Table 7 shows that the above p values are all greater than 0.1, so there is no reason to reject the null hypothesis. Therefore, we conclude that the distribution of MAPE in the noise experiment conforms to a normal distribution. Then, we can check whether the MAPE mean of the noise experiment is smaller than the given population mean, which is denoted as MAPE normal  Table 6.
Mean normal represents the MAPE of a certain model without noise, and the data are shown in Table 5. The results of the t test are shown in Table 8. According to the results of Tables 6 and 8, the p value of the t test in the GRU noise experiment is less than 0.1. The mean MAPE of the GRU model in 10 noise experiments is 0.03327, which can be considered to be significantly higher than 0.028152 when noise is not added. Therefore, we can conclude that the GRU model has poor anti-noise ability. Similarly, the p value of the t test in the GRU-MMattention noise experiment is less than 0.1, and we can also draw the conclusion that GRU-MMattention has poor anti-noise ability.
However, for the LightGBM method in the noise experiment, its p value of the t test is more than 0.1. The mean MAPE on LightGBM in 10 noise experiments is 0.02720, which cannot be judged to be significantly greater than 0.026851 without noise. It can be considered that the impact of noise on the prediction results of the LightGBM model is not significant, and the LightGBM method has a strong anti-noise ability. Similarly, the p value of the t test on GRU-MMatten-LGB is also more than 0.1. It can be concluded that GRU-MMatten-LGB has a strong anti-noise ability.

Conclusion
Based on PJM District Energy Company's energy consumption data in 14 districts from January 1, 2015, to August 3, 2018, our conclusions are as follows: